# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env
# System libraries
import glob
import os
import pickle
# ML libraries
import pandas as pd
# ValidMind libraries
import validmind as vmTime Series Data Validation Demo
Introduction
The Time Series Data Validation Demo notebook aims to demonstrate the application of various data validation tests using the ValidMind MRM Platform and Developer Framework. As model developers working in the financial sector, ensuring the quality and reliability of time series data is essential for accurate model predictions and robust decision-making processes.
In this demo, we will walk through different data validation suites of tests tailored for time series data, showcasing how these tools can assist you in identifying potential issues and inconsistencies in the data. By utilizing the ValidMind MRM platform and developer framework, you can streamline your data validation process, allowing you to focus on building and refining your models with confidence.
Let’s get started!
Setup
Prepare the environment for our analysis. First, import all necessary libraries and modules required for our analysis. Next, define and configure the specific use case we are working on by setting up any required parameters, data sources, or other settings that will be used throughout the analysis. Finally, establish a connection to the ValidMind MRM platform, which provides a comprehensive suite of tools and services for model validation.
Import Libraries
Use Case Configuration
from validmind.datasets.regression import fred
iris_df = fred.load_data()dataset = 'fred'
if dataset == 'lending_club':
target_column = ['loan_rate_A']
feature_columns = ['loan_rate_B', 'loan_rate_C', 'loan_rate_D']
from validmind.datasets.regression import lending_club
raw_df = lending_club.load_data()
if dataset == 'fred':
target_column = ['MORTGAGE30US']
feature_columns = ['FEDFUNDS', 'GS10', 'UNRATE']
from validmind.datasets.regression import fred
raw_df = fred.load_data()
selected_cols = target_column + feature_columns
raw_df = raw_df[selected_cols]Connect to ValidMind MRM Platform
vm.init(
api_host = "http://localhost:3000/api/v1/tracking",
project = "clhhz04x40000wcy6shay2oco"
)Connected to ValidMind. Project: Customer Churn Model - Initial Validation (clhhz04x40000wcy6shay2oco)
Data Description
display(raw_df)| MORTGAGE30US | FEDFUNDS | GS10 | UNRATE | |
|---|---|---|---|---|
| DATE | ||||
| 1947-01-01 | NaN | NaN | NaN | NaN |
| 1947-02-01 | NaN | NaN | NaN | NaN |
| 1947-03-01 | NaN | NaN | NaN | NaN |
| 1947-04-01 | NaN | NaN | NaN | NaN |
| 1947-05-01 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... |
| 2023-04-01 | NaN | NaN | 3.46 | NaN |
| 2023-04-06 | 6.28 | NaN | NaN | NaN |
| 2023-04-13 | 6.27 | NaN | NaN | NaN |
| 2023-04-20 | 6.39 | NaN | NaN | NaN |
| 2023-04-27 | 6.43 | NaN | NaN | NaN |
3551 rows × 4 columns
raw_df.info()<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3551 entries, 1947-01-01 to 2023-04-27
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MORTGAGE30US 2718 non-null float64
1 FEDFUNDS 825 non-null float64
2 GS10 841 non-null float64
3 UNRATE 903 non-null float64
dtypes: float64(4)
memory usage: 138.7 KB
Data Preparation
List of Available Test Plans
The vm.test_plans.list_plans() function is a part of the ValidMind (vm) library that provides a comprehensive list of available test plans. These test plans are pre-built sets of tests designed to perform automated data and model validation, such as data quality, exploratory data analysis, and model performance.
vm.test_plans.list_plans()| ID | Name | Description |
|---|---|---|
| sklearn_classifier_metrics | SKLearnClassifierMetrics | Test plan for sklearn classifier metrics |
| sklearn_classifier_validation | SKLearnClassifierPerformance | Test plan for sklearn classifier models |
| sklearn_classifier_model_diagnosis | SKLearnClassifierDiagnosis | Test plan for sklearn classifier model diagnosis tests |
| sklearn_classifier | SKLearnClassifier | Test plan for sklearn classifier models that includes both metrics and validation tests |
| tabular_dataset | TabularDataset | Test plan for generic tabular datasets |
| tabular_dataset_description | TabularDatasetDescription | Test plan to extract metadata and descriptive statistics from a tabular dataset |
| tabular_data_quality | TabularDataQuality | Test plan for data quality on tabular datasets |
| normality_test_plan | NormalityTestPlan | Test plan to perform normality tests. |
| autocorrelation_test_plan | AutocorrelationTestPlan | Test plan to perform autocorrelation tests. |
| seasonality_test_plan | SesonalityTestPlan | Test plan to perform seasonality tests. |
| unit_root | UnitRoot | Test plan to perform unit root tests. |
| stationarity_test_plan | StationarityTestPlan | Test plan to perform stationarity tests. |
| timeseries | TimeSeries | Test plan for time series statsmodels that includes both metrics and validation tests |
| time_series_data_quality | TimeSeriesDataQuality | Test plan for data quality on time series datasets |
| time_series_dataset | TimeSeriesDataset | Test plan for time series datasets |
| time_series_univariate | TimeSeriesUnivariate | Test plan to perform time series univariate analysis. |
| time_series_multivariate | TimeSeriesMultivariate | Test plan to perform time series multivariate analysis. |
| time_series_forecast | TimeSeriesForecast | Test plan to perform time series forecast tests. |
| regression_model_performance | RegressionModelPerformance | Test plan for statsmodels regressor models that includes both metrics and validation tests |
Data Quality
Run Data Quality Test Plan
Use the ValidMind (vm) library to perform data quality tests on a time series dataset. The process begins by describing a test plan specifically designed for time series data quality. This test plan contains a set of tests that evaluate the quality of the provided time series data.
Next, the raw DataFrame is used to initialize a dataset using the vm library. This newly created dataset object, vm_dataset, is then utilized for further processing. The test plan parameters are configured to define the z-score threshold for outlier detection and the minimum threshold for identifying missing values.
Finally, the test plan, time_series_data_quality, is executed using the vm.run_test_plan() function with the initialized dataset and the configuration settings provided. This function applies the specified tests to the dataset and generates a report on the quality of the time series data based on the configured parameters.
vm.test_plans.describe_plan("time_series_data_quality")| Attribute | Value |
|---|---|
| ID | time_series_data_quality |
| Name | TimeSeriesDataQuality |
| Description | Test plan for data quality on time series datasets |
| Required Context | ['dataset'] |
| Tests | TimeSeriesOutliers (ThresholdTest), TimeSeriesMissingValues (ThresholdTest), TimeSeriesFrequency (ThresholdTest) |
| Test Plans | [] |
vm_dataset = vm.init_dataset(
dataset=raw_df
)
config={
"time_series_outliers": {
"zscore_threshold": 3,
},
"time_series_missing_values":{
"min_threshold": 2,
}
}
vm.run_test_plan("time_series_data_quality", dataset=vm_dataset, config=config)Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
Running ThresholdTest: time_series_outliers: 0%| | 0/3 [00:00<?, ?it/s]
Variable z-score Threshold Date
0 FEDFUNDS 3.707038 3 1981-05-01
Results for Time Series Data Quality Test Plan:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
TimeSeriesDataQuality(test_context=TestContext(dataset=Dataset(raw_dataset= MORTGAGE30US FEDFUNDS GS10 UNRATE
DATE
1947-01-01 NaN NaN NaN NaN
1947-02-01 NaN NaN NaN NaN
1947-03-01 NaN NaN NaN NaN
1947-04-01 NaN NaN NaN NaN
1947-05-01 NaN NaN NaN NaN
... ... ... ... ...
2023-04-01 NaN NaN 3.46 NaN
2023-04-06 6.28 NaN NaN NaN
2023-04-13 6.27 NaN NaN NaN
2023-04-20 6.39 NaN NaN NaN
2023-04-27 6.43 NaN NaN NaN
[3551 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}]}, {'id': 'tail', 'data': [{'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': 3.46, 'UNRATE': nan}, {'MORTGAGE30US': 6.28, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': 6.27, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': 6.39, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': 6.43, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}]}], shape={'rows': 3551, 'columns': 4}, correlation_matrix=None, correlations=None, type='training', options=None, statistics=None, targets=None, target_column=None, class_labels=None, _Dataset__feature_lookup={}, _Dataset__transformed_df=None), model=None, models=_CountingAttr(counter=41, _default=NOTHING, repr=True, eq=True, order=True, hash=None, init=True, on_setattr=None, alias=None, metadata={}), train_ds=None, test_ds=None, validation_ds=None, y_train_predict=None, y_test_predict=None, context_data=None), config={...})
Handling Frequencies.
def identify_frequencies(df):
"""
Identify the frequency of each series in the DataFrame.
:param df: Time-series DataFrame
:return: DataFrame with two columns: 'Variable' and 'Frequency'
"""
frequencies = []
for column in df.columns:
series = df[column].dropna()
if not series.empty:
freq = pd.infer_freq(series.index)
if freq == 'MS' or freq == 'M':
label = 'Monthly'
elif freq == 'Q':
label = 'Quarterly'
elif freq == 'A':
label = 'Yearly'
else:
label = freq
else:
label = None
frequencies.append({'Variable': column, 'Frequency': label})
freq_df = pd.DataFrame(frequencies)
return freq_dffrequencies = identify_frequencies(raw_df)
display(frequencies)| Variable | Frequency | |
|---|---|---|
| 0 | MORTGAGE30US | None |
| 1 | FEDFUNDS | Monthly |
| 2 | GS10 | Monthly |
| 3 | UNRATE | Monthly |
Resample.
preprocessed_df = raw_df.resample('MS').last()
frequencies = identify_frequencies(preprocessed_df)
display(frequencies)| Variable | Frequency | |
|---|---|---|
| 0 | MORTGAGE30US | Monthly |
| 1 | FEDFUNDS | Monthly |
| 2 | GS10 | Monthly |
| 3 | UNRATE | Monthly |
Run Data Quality Test Plan.
vm_dataset = vm.init_dataset(
dataset=preprocessed_df
)
vm.run_test_plan("time_series_data_quality", dataset=vm_dataset, config=config)Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
Running ThresholdTest: time_series_outliers: 0%| | 0/3 [00:00<?, ?it/s]
Variable z-score Threshold Date
0 FEDFUNDS 3.106442 3 1980-03-01
1 FEDFUNDS 3.212296 3 1980-04-01
2 FEDFUNDS 3.537417 3 1980-12-01
3 FEDFUNDS 3.582783 3 1981-01-01
4 FEDFUNDS 3.441645 3 1981-05-01
5 FEDFUNDS 3.587823 3 1981-06-01
6 FEDFUNDS 3.572701 3 1981-07-01
7 FEDFUNDS 3.265222 3 1981-08-01
8 MORTGAGE30US 3.246766 3 1981-09-01
9 MORTGAGE30US 3.271251 3 1981-10-01
10 MORTGAGE30US 3.011098 3 1982-01-01
11 UNRATE 5.011303 3 2020-04-01
12 UNRATE 4.128421 3 2020-05-01
Results for Time Series Data Quality Test Plan:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
TimeSeriesDataQuality(test_context=TestContext(dataset=Dataset(raw_dataset= MORTGAGE30US FEDFUNDS GS10 UNRATE
DATE
1947-01-01 NaN NaN NaN NaN
1947-02-01 NaN NaN NaN NaN
1947-03-01 NaN NaN NaN NaN
1947-04-01 NaN NaN NaN NaN
1947-05-01 NaN NaN NaN NaN
... ... ... ... ...
2022-12-01 6.42 4.10 3.62 3.5
2023-01-01 6.13 4.33 3.53 3.4
2023-02-01 6.50 4.57 3.75 3.6
2023-03-01 6.32 4.65 3.66 3.5
2023-04-01 6.43 NaN 3.46 NaN
[916 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}, {'MORTGAGE30US': nan, 'FEDFUNDS': nan, 'GS10': nan, 'UNRATE': nan}]}, {'id': 'tail', 'data': [{'MORTGAGE30US': 6.42, 'FEDFUNDS': 4.1, 'GS10': 3.62, 'UNRATE': 3.5}, {'MORTGAGE30US': 6.13, 'FEDFUNDS': 4.33, 'GS10': 3.53, 'UNRATE': 3.4}, {'MORTGAGE30US': 6.5, 'FEDFUNDS': 4.57, 'GS10': 3.75, 'UNRATE': 3.6}, {'MORTGAGE30US': 6.32, 'FEDFUNDS': 4.65, 'GS10': 3.66, 'UNRATE': 3.5}, {'MORTGAGE30US': 6.43, 'FEDFUNDS': nan, 'GS10': 3.46, 'UNRATE': nan}]}], shape={'rows': 916, 'columns': 4}, correlation_matrix=None, correlations=None, type='training', options=None, statistics=None, targets=None, target_column=None, class_labels=None, _Dataset__feature_lookup={}, _Dataset__transformed_df=None), model=None, models=_CountingAttr(counter=41, _default=NOTHING, repr=True, eq=True, order=True, hash=None, init=True, on_setattr=None, alias=None, metadata={}), train_ds=None, test_ds=None, validation_ds=None, y_train_predict=None, y_test_predict=None, context_data=None), config={...})
Remove missing values.
preprocessed_df = preprocessed_df.dropna()Run Data Quality Test Plan.
vm_dataset = vm.init_dataset(
dataset=preprocessed_df,
target_column=target_column
)
vm.run_test_plan("time_series_data_quality", dataset=vm_dataset, config=config)Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
Running ThresholdTest: time_series_outliers: 0%| | 0/3 [00:00<?, ?it/s]
Variable z-score Threshold Date
0 FEDFUNDS 3.106442 3 1980-03-01
1 FEDFUNDS 3.212296 3 1980-04-01
2 FEDFUNDS 3.537417 3 1980-12-01
3 FEDFUNDS 3.582783 3 1981-01-01
4 FEDFUNDS 3.441645 3 1981-05-01
5 FEDFUNDS 3.587823 3 1981-06-01
6 FEDFUNDS 3.572701 3 1981-07-01
7 FEDFUNDS 3.265222 3 1981-08-01
8 MORTGAGE30US 3.246766 3 1981-09-01
9 MORTGAGE30US 3.271251 3 1981-10-01
10 MORTGAGE30US 3.011098 3 1982-01-01
11 UNRATE 5.011303 3 2020-04-01
12 UNRATE 4.128421 3 2020-05-01
Results for Time Series Data Quality Test Plan:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
Logged the following test result to the ValidMind platform:
TimeSeriesDataQuality(test_context=TestContext(dataset=Dataset(raw_dataset= MORTGAGE30US FEDFUNDS GS10 UNRATE
DATE
1971-04-01 7.29 4.16 5.83 5.9
1971-05-01 7.46 4.63 6.39 5.9
1971-06-01 7.54 4.91 6.52 5.9
1971-07-01 7.69 5.31 6.73 6.0
1971-08-01 7.69 5.57 6.58 6.1
... ... ... ... ...
2022-11-01 6.58 3.78 3.89 3.6
2022-12-01 6.42 4.10 3.62 3.5
2023-01-01 6.13 4.33 3.53 3.4
2023-02-01 6.50 4.57 3.75 3.6
2023-03-01 6.32 4.65 3.66 3.5
[624 rows x 4 columns], fields=[{'id': 'MORTGAGE30US', 'type': 'Numeric'}, {'id': 'FEDFUNDS', 'type': 'Numeric'}, {'id': 'GS10', 'type': 'Numeric'}, {'id': 'UNRATE', 'type': 'Numeric'}], sample=[{'id': 'head', 'data': [{'MORTGAGE30US': 7.29, 'FEDFUNDS': 4.16, 'GS10': 5.83, 'UNRATE': 5.9}, {'MORTGAGE30US': 7.46, 'FEDFUNDS': 4.63, 'GS10': 6.39, 'UNRATE': 5.9}, {'MORTGAGE30US': 7.54, 'FEDFUNDS': 4.91, 'GS10': 6.52, 'UNRATE': 5.9}, {'MORTGAGE30US': 7.69, 'FEDFUNDS': 5.31, 'GS10': 6.73, 'UNRATE': 6.0}, {'MORTGAGE30US': 7.69, 'FEDFUNDS': 5.57, 'GS10': 6.58, 'UNRATE': 6.1}]}, {'id': 'tail', 'data': [{'MORTGAGE30US': 6.58, 'FEDFUNDS': 3.78, 'GS10': 3.89, 'UNRATE': 3.6}, {'MORTGAGE30US': 6.42, 'FEDFUNDS': 4.1, 'GS10': 3.62, 'UNRATE': 3.5}, {'MORTGAGE30US': 6.13, 'FEDFUNDS': 4.33, 'GS10': 3.53, 'UNRATE': 3.4}, {'MORTGAGE30US': 6.5, 'FEDFUNDS': 4.57, 'GS10': 3.75, 'UNRATE': 3.6}, {'MORTGAGE30US': 6.32, 'FEDFUNDS': 4.65, 'GS10': 3.66, 'UNRATE': 3.5}]}], shape={'rows': 624, 'columns': 4}, correlation_matrix=None, correlations=None, type='training', options=None, statistics=None, targets=None, target_column=['MORTGAGE30US'], class_labels=None, _Dataset__feature_lookup={}, _Dataset__transformed_df=None), model=None, models=_CountingAttr(counter=41, _default=NOTHING, repr=True, eq=True, order=True, hash=None, init=True, on_setattr=None, alias=None, metadata={}), train_ds=None, test_ds=None, validation_ds=None, y_train_predict=None, y_test_predict=None, context_data=None), config={...})
Exploratory Data Analysis
Univariate Analysis
Run Time Series Univariate Test Plan
vm.test_plans.describe_plan("time_series_univariate")| Attribute | Value |
|---|---|
| ID | time_series_univariate |
| Name | TimeSeriesUnivariate |
| Description | Test plan to perform time series univariate analysis. |
| Required Context | ['dataset'] |
| Tests | TimeSeriesLinePlot (Metric), TimeSeriesHistogram (Metric), ACFandPACFPlot (Metric), SeasonalDecompose (Metric), AutoSeasonality (Metric), AutoStationarity (Metric), RollingStatsPlot (Metric), AutoAR (Metric), AutoMA (Metric) |
| Test Plans | [] |
test_plan_config = {
"time_series_line_plot": {
"columns": target_column + feature_columns
},
"time_series_histogram": {
"columns": target_column + feature_columns
},
"acf_pacf_plot": {
"columns": target_column + feature_columns
},
"auto_ar": {
"max_ar_order": 3
},
"auto_ma": {
"max_ma_order": 3
},
"seasonal_decompose": {
"seasonal_model": 'additive',
"fig_size": (40,30)
},
"auto_seasonality": {
"min_period": 1,
"max_period": 3
},
"auto_stationarity": {
"max_order": 3,
"threshold": 0.05
},
"rolling_stats_plot": {
"window_size": 12
},
}
vm_dataset = vm.init_dataset(
dataset=preprocessed_df
)
vm.run_test_plan("time_series_univariate", config=test_plan_config, dataset=vm_dataset)Pandas dataset detected. Initializing VM Dataset instance...
Inferring dataset types...
Running Metric: acf_pacf_plot: 22%|██▏ | 2/9 [00:00<00:01, 4.27it/s] The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
Running Metric: seasonal_decompose: 33%|███▎ | 3/9 [00:01<00:02, 2.24it/s]The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
Running Metric: auto_ma: 89%|████████▉ | 8/9 [00:04<00:00, 1.71it/s] Non-invertible starting MA parameters found. Using zeros as starting parameters.
Warning: MORTGAGE30US is not stationary. Results may be inaccurate.
Warning: FEDFUNDS is not stationary. Results may be inaccurate.
Warning: GS10 is not stationary. Results may be inaccurate.
Warning: MORTGAGE30US is not stationary. Results may be inaccurate.
Warning: FEDFUNDS is not stationary. Results may be inaccurate.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
Warning: GS10 is not stationary. Results may be inaccurate.
Non-invertible starting MA parameters found. Using zeros as starting parameters.
Non-invertible starting MA parameters found. Using zeros as starting parameters.